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ABSTRACT 

This first paper in a series describes the design of a study testing whether pre-appearance signa- 
tures of solar magnetic active regions were detectable using various tools of local helioseismology. The 
ultimate goal is to understand flux- emergence mechanisms by setting observational constraints on 
pre-appearance subsurface changes, for comparison with results from simulation efforts. This first 
paper provides details of the data selection and preparation of the samples, each containing over 100 
members, of two populations: regions on the Sun that produced a numbered NOAA active region, 
and a control sample of areas that did not. The seismology is performed on data from the GONG net- 
work; accompanying magnetic data from SOHO/MDI are used for co-temporal analysis of the surface 
magnetic field. Samples are drawn from 2001 - 2007, and each target is analyzed for 27.7 hr prior to 
an objectively determined time of emergence. The results of two analysis approaches are published 
separately: one based on averages of the seismology- and magnetic-derived signals over the samples, 
another based on Discriminant Analysis of these signals, for a statistical test of detectable differ- 
ences between the two populations. We include here descriptions of a new potential-field calculation 
approach and the algorithm for matching sample distributions over multiple variables. We describe 
known sources of bias and the approaches used to mitigate them. We also describe unexpected bias 
sources uncovered during the course of the study and include a discussion of refinements that should 
be included in future work on this topic. 

Subject headings: Sun: helioseismology - Sun: interior - Sun: magnetic fields - Sun: oscillations 



1. INTRODUCTION 

We refer to the appearance of new solar active regions 
as "emergence" , implying a rise from below the visible 
photosphere. Yet the appearance and evolution of an 
active region from the surface through the corona is the 
symptom, the result - filtered through the r = 1 bound- 
ary and the transitions from high- to low-/? plasmas - of 
some (yet unknown) process happening below the visible 
surface. 

One general class of theories suggests that active re- 
gions form as the result of magnetic flux concentrations 
rising buoyantly from the base of the convection zone 
(for a review see Fan 2009). Another possibility is that 
sunspots are formed via coagulatio n of magnetic fields 
generated closer to the solar surface ( BrandenburgpOOlT 



and references therein). The pre-emergence seismic sig- 
natures expected from these two approaches differ sub- 
stantially. From the former scenario one should expect 
signals generally taking the form of a bulk and quickly 
moving disturbance whose internal plasma flow should 
result in a signal detectable with today's tools (Birch 
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et al 
likely 



20101. In the latter case, the expectation would 
e a slower change in the sub-surface temperature, 
flow, and magnetic field environment over a less localized 
area. Simulations which focus on the dynamics of flux 
systems rising through the upper layers imply that slowly 
rising flux systems may impact the convection only mini- 
mally (Stein et al. 2011 1, depending on the field strengths 
involved. Still, simulations provide clues but are limited; 

1 NorthWest Research Associates, Boulder, CO 80301 USA 

2 National Solar Observatory, Tucson, AZ 85719 USA 

3 Max-Planck Institut fur Sonnensystemforschung, 37191 
Katlenburg-Lindau, Germany 



observations must continue to provide guidance. 

Being able to peer below the visible surface at the sub- 
surface structure and dynamics could provide the guid- 
ance regarding the formation mechanism for solar active 
regions ("AR"). Helioseismology seems to promise the 
ability to detect changes in the flow patterns and temper- 
ature beneath the visible surface. From the pur e physics 
perspective, the tools of loca l helioseismology (Gizon & 
rch|2005 |Gizon et al.|2010" ) should help determine the 
bsurface dynamics associated with active region forma- 
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tion, and thus could provide evidence for or against the 
basic model types. Some preliminary work (described 
below) applying sensitive tools of this type to data-sets 
well suited for these techniques suggests that the capa- 
bility may now be available. 

Most recent efforts have been case studies, focusing 
on the emergence of one or a few active regions (e.g., 



Jensen et al. 
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2011; 



| |Braun||2012| 7"^ 

hen taken as an 
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ble, possibly due to the physics of active region emer- 
gence, possibly due to the differences between the studies 
themselves. Inverting time-di stance data from M DI us- 



ing three-dimensional kernels, Jensen et al. (2001) found 
perturbations indicating wave-speed increases 20 Mm be- 
low t wo active regions in the hours after their appear- 
ance. iZharkov & Thompson (20081, using a very sim- 



ilar method tor two active regions, found a similar in- 
crease when surface flux was visible, but also a "loop-like 
structure" with decreased sou nd-speed, days prior t o the 



appearance of surface flux. Ilonidis et al. (2011) also 



employ time-distance analysis of MDI data, and present 
very large negative travel-time shifts (increases in the 
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sound speed) located between 42-75 Mm up to two days 
prior to surface flux appearance of four active regions. 
They associate these disturbances with magnetic struc- 
tures emerging at speeds of 0.3-0.6 km s _1 , and do see a 
hig h rate of flux emergence following the perturbations. 
Yet Braun (2012) using acoustic holography on the same 
data for the same four active regions, detect no such 
unique signals at the specified times and depths. Em- 
ploying ring-diagram analysis of GONG data for 13 new 
or gro wing active reg i ons (a nd contrasting with control 
areas) , |Komm et al.| ( |2008[ ) found evidence for upflows 
prior to the appearance ot emerging flux at the surface, 
followed by a transition to predominantly downflows once 
the active region was established. 

Ring-diagram analysis was also used in statistical stud- 
ies of seismic signatures associated with emerging mag- 
netic flux, comparing average signals for hundreds of re- 
gions with increasing flux t o either "quiet" areas or t o 
those with decreasing flux (Komm et al. |2009 2011). 
While the analysis had fairly low temporal and spatial 
resolution, upflows were associated with emerging flux at 
depths below lOMm whereas at shallower layers, upflows 
changed to downflows as surface field became stronger. 
These studies examined the broad spectrum of surface- 
field behavior: growing flux, consistent flux, and decreas- 
ing flux. However, the "emerging flux" category did not 
differentiate between "new" active regions and emerging 
flux within already established regions. 

The conflicting results in case studies could indicate 
that there is no unique signature, or that results are 
sensitive to subtle methodology differences. The few 
published statistical studies have been based on a sin- 
gle method, and now need to be refined to focus solely 
on the pre-emergence context, and employ higher reso- 
lution analysis. 

In the present investigation we employ a combination 
of local helioseismology, surface magnetic field diagnos- 
tics and statistical tests to examine what can be learned 
with regards to sub-surface magnetic flux systems, their 
structure, and their evolution. The ba sic premise of this 



series of papers (th is paper along with Birch et al. 112012 



Barnes et al.|2012[ ) is to determine if there are detectable 
changes in the solar interior that indicate an emerging ac- 
tive region prior to the appearance at the solar surface 
of a magnetic field concentration. 

We have designed and completed a study to examine 
the possibility of pre-emergence detection of active re- 
gions, with the goal of characterizing the sub-surface 
changes in the context of emerging-flux models. The 
approach pays attention to sources of bias, statistical 
and systematic error, and includes statistical validation 
of the results. The organization of this paper is as fol- 
lows: in section § [2] we outline the physical parameters 
within which the overall study must work, and the sta- 
tistical motivation for the overall design of our study. 
We describe the data used and its treatment in § [3j and 



3.1 describe the target selection criteria, justifica- 



tion, and implementation. We discuss sources of sta- 
tistical contamination in § [3J The most s alient points 
are sy nthesized in § [5] as groundwork for |Birch et al.| 
( 20121), where th e helioseismic analysis is presented, and 
foF Barnes et al. ( 2012[ ), where the statistical analysis of 
the helioseismic and magnetic data are presented. 



2. STUDY DESIGN 

The goal of this study is to determine whether there 
exists a pre-emergence signature of solar active regions 
visible using local helioseismic methods and understand 
said signal, if it exists, in the context of active-region 
formation theory. As summarized above, case studies 
have led to conflicting results. We have designed a study 
that utilizes appropriate statistical tests applied to data 
which include "control" samples. Such a study requires 
two basic things: sufficient samples of both "event" data 
and a control set, and care in selecting both samples so 
as to minimize bias. 

It is fortunate now that there are sufficient data avail- 
able to perform such a study, including a statistical anal- 
ysis of the results . The statistical method we use in 



Barnes et al. (2012) is discriminant analysis (e.g., Kendall 
et al.||l983| ), a technique that tests for any difference be- 
tween the two samples. As such, any systematic bias that 
is present in the sampling from one population but ab- 
sent in the other may appear as a false discriminant. For 
example, if all samples for one population were obtained 
from east of central meridian while all samples for the 
other were obtained from west of central meridian, then 
the samples could be differentiated simply due to a bias 
in the Doppler signal from solar rotation, not a true de- 
tection of emergence. We refer to this bias as statistical 
contamination. 

The basic data comprise time-series of Doppler velocity 
obtained at the solar surface, from which shifts in sub- 
surface travehtimfis_jxederived using helioseismic holog- 
raphy ( |Lindsey fc Braun||2000~| |Braun et al.||2007[ ). Ob- 
taining aTeTIaT5Ie~l3eTsimc signature requires a temporal 
sequence of data, the length of which will govern the 
signal-to-noise ratios of the inferred subsurface patterns; 
yet the data quality may degrade with proximity to the 
solar limb. These realities create limits on the observable 
solar disk available for drawing the samples. 

In the case of analysis using helioseismology, bias may 
take many forms . Due to the global frequency shifts 
with solar cycle ()Woodard~ fc Noyes 1985 |Christensen- 
Dalsgaard|2002||Chaplin et al.|2007[ ) the Doppler velocity 
signals may have a component di stinctly linked directly 
to the date. Systematic effects dBraun fc Birch" 2008 



Zhao et al.||2012| iBaldner & Schouf|2012) may create a 



dependence of the helioseismology results on apparent 
disk position. Active regions emerge within a fairly nar- 
row latitude range which itself shifts with the phase of 
the activity cycle, leading to another potential source of 
bias. 

To allow an unambiguous detection of subsurface sig- 
nals, the emergence episodes should be isolated in time 
and space from other strong magnetic sources and nearby 
emergence episodes. Yet active regions often emerge 
in close proximity to already-established act i ve regions 
or remnant fields (iPetrovay fe Abuz eid 
Zwaan|1993 Pojoga 



1991| |Harvey fc 
Cudnik 2002 ) . The controls must 
ideally also have no magnetic emergence occurring, and 
minimal strong-field regions within the immediate field of 
view, but they must also match the magnetic context of 
the population of emerging targets, as the solar disk gets 
crowded with active regions and their remnants during 
the solar maximum years. 
Thus, it is key to couple observations of the solar sur- 
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face magnetic field and its evolution to the selection and 
characterization of the seismology data. Pairing the mag- 
netic data to the seismic data provides guidance for in- 
terpreting any seismic signature observed, both in the 
control and event groups. 
The study is designed based on the following steps: 

1. Locate and identify a statistically significant sam- 
ple of the population of new active region appear- 
ances, according to constraints imposed to mini- 
mize bias and noise. 

2. Locate and identify a sample of the emergence- free 
population, matched in time and position to the 
pre-emergence sample, to serve as a control. 

3. Apply helioseismic data analysis "blindly" to the 
two samples. 

4. Parametrize the results from the helioseismic anal- 
ysis and magnetic field data. 

5. Apply Discriminant Analysis to the seismic and 
magnetic parameters to quantify the differences be- 
tween the two samples. 

3. DATA 

A study such as this requires a statistically significant 
sample drawn from the populations in question. Lim- 
itations posed due to observational and statistical con- 
straints, described in detail below, thus pointed to us- 
ing data from the Global Oscillations Network Group 
("GONG"), from the era after the camera upgrades (be - 
ginning in 2001, |Harvey et al]|1998| |Hill et al.||2003|) . 
The GONG system records wavelength-modulated tull- 
disk images sampled at 2.5" for 5" optical resolution, 
from which Doppler signals are retrieved on a 1-minute 
cadence. 

Key to interpreting any detected seismic signature is 
knowing the "landscape" of the surface magnetic field. 
As we are specifically interested in pre-emergence signa- 
tures, the surface magnetic fields and the signature of 
magnetic flux emergence define the timing for the entire 
project. At the time of design and implementation of 
this study, the line-of-sight field from the GONG data 
were not readily available. We thus rely upon the full- 
disk line-of-sight component magnetic field data from the 
Michelson Doppler Imager aboard the Solar and Helio - 
spheric Observatory (SOHO/MDI, |Scherrer et al.||1995[ ). 
Specifically, we used the level 1.8.2 synoptic data ac- 
quired with a 96-minute cadence and 1.98" pixel size 
to qualitatively and quantitatively evaluate the magnetic 
landscape of the samples. 

Helioseismic data from MDI were not used in this 
study for two reasons. First, the high-rate full disk 
data ( "dynamic campaigns" ) are only available for a few 
months per year, limiting the data available for a sta- 
tistical study. Second, the medium-^ ("structure") data 
are not optimal for studying wave propagation at dis- 
tances less than approximately ten heliocentric degrees 

4 Emergence times were initially determined using earlier level 
1.8.1 data, but we do not expect any systematic differences as the 
emergence times were based on the change of the signal, not a 
pre-determined threshold. 



(Giles 20001 whereas the present work examines depths 
< 25 Mm which requires small distances. For these rea- 
sons, we have used the GONG data for the helioseismic 
analysis performed in this study. 

3.1. Target Selection Criteria: The "PE"s: 
Pre- Emergence Regions 

The initial target list for emerging active regions was 
derived from the "Sunspot Group Reports" produced by 
USAF/NOAA and available through the National Geo- 
physical Data Centeij^] The date-range used was chosen 
according to requirements for the helioseismology data, 
and covered July 2001 - November 2007. Regions listed 
as first appearing within 8 < 30° of disk center and which 
achieved an area > 10 x 10~ 6 hemispheres (/xH) during 
their disk passage determined the initial target list, and 
the initial emergence times and locations, that were sub- 
sequently refined. 

MDI 3-day time-series were constructed centered on 
this initial emergence date and time, using a fixed 
128 x 100 pixel box centered on the initial emergence 
location (see Figure [l] for a schematic], and tracked 
with the synodic rotation rate (Figure [2| . As a check 
against extreme viewing angles at the beginning or end 
of the time-series, additional limits on the edges of the 
box were placed at E41 and W67 heliographic longitude 
(East longitudes are < 0)) and ±60° heliographic lati- 
tude. The -Bi os data were initially summed to a pseudo- 
"fiux", $i os = l-Sios|/V AA, where \i = cos 9 and 9 is 
the observing angle, and A A is the physical area of a 
pixel. A refined emergence time, to j was defined as the 
time of the first MDI observation after $i os reached 10% 
of the maximum achieved (minus any flux present at the 
beginning of the time-series) over the time series. That 
is, the "10% rule" refers to 10% of the maximum increase 
detected. The kurtosis (fourth moment) of the distribu- 
tion of B\ os in the frame generally increases dramatically 
at the time of emergence, signifying a distinct change 
in the spatial distribution of i?i os ; a sudden change in 
the kurtosis was used to confirm the "10% rule" but was 
not relied upon in isolation. Thus, the emergence time 
is only defined within the 96-minute MDI cadence. For 
the analysis methods later applied, which require many 
hours of data, there is little to be gained by refining this 
definition further. The NOAA reports of active region 
coordinates were generally accurate, although our defini- 
tion of to was generally earlier than the NOAA reports 
by anywhere from a few hours up to a day. 

Emergence of surface field is rarely a smoothly mono- 
tonic process ( Zwaan|1985 Leka et al.|1994 Kubo et ah 
2003). An example of that realit y is shown in Figure [3 
(and discussed further in Section 4.3). As such, the flux 



history and thresholds here constitute a selection rule 
to be used for a statistical approach, rather than a pro- 
found statement of solar physics. And as such, there will 
be regions for which the definition blatantly misses the 
mark of rising flux presence. The goal here is a well- 
defined "good option", that is objective and repeatable 
for a statistically-significant sample of data. 

Regions were rejected for a number of reasons, primar- 
ily data-gaps (in either MDI at or near the emergence 
time, or GONG data for final analysis) or immediate 



5 http : //www .ngdc .noaa. gov/stp/solar/sunspotregionsdata. html 
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proximity (within the 128 x 100 pixel box) of another ac- 
tive region. No further tests were made concerning the 
eventual size of the active region or speed of emergence; a 
later subjective evaluation rejected regions if to appeared 
incorrect by more than a few MDI-derived data points. 
The fixed box used at this stage was fairly restrictive. 

The refined location and time of emergence, defined as 
above , we re used to generate the Doppler-velocity data 
(see § 3.4). The final result is 107 pre-emergence ("PE") 
target regions between 2001 and 2007. In Table [I] we 
list the identifying features of these regions: the NOAA 
Active Region number, the t as defined above, and the 
latitude and longitude of the center of the 128 x 100 pixel 
box at that time. Note that the longitude was generally 
refined from the NOAA reports, while the latitude gen- 
erally was not, and as such is effectively an integer. In 
Figure [4] we show the final distribution of the (eventual) 
maximum size achieved (as reported in the NOAA com- 
pilations) for the active regions in the PE list. 

A subset of eleven regions are singled out as being 
particularly "clean", and these are indicated with a su- 
perindex "a" in Table [T] The criteria for this list are 
completely subjective: no neighboring active region in 
the extracted areas, a very flat pre-emergence flux his- 
tory, and an emergence characterized by a very uniform 
and steep slope of d&\ os /dt. The example shown in Fig- 
ure [2] is one such member of the "Ultra-Clean Subset". 

3.2. Target Selection Criteria: The "NE"s: 
No-Emergence Control Regions 

The active-region emergence targets required an ac- 
companying set of "control data". As our final analy- 
sis is a statistical analysis based on the results of both 
helioseismology-derived and magnetic-derived parame- 
ters, the control data needed to be constructed so as to 
not introduce statistical bias into the final distributions. 
We outline the construction of this data set here. 

Starting every two MDI days during the same 2001- 
2007 interval, using the same-sized 128 x 100-pixel 
tracked boxes, areas were identified where the underlying 
signal stayed consistently < 1000 GQ This was accom- 
plished by "stacking" three days' worth of MDI data and 
extending the target box to effectively cover the tracked 
area, as shown in Figure [5j Random locations for these 
low-field areas were chosen on the disk for each stack, 
subject to the same general constraints as the PE tar- 
gets with regards to limits on latitude and longitude. A 
time close to the center of the 3-day interval, falling on 
an MDI observed time, is designated to for the NE data. 
While there was the possibility of overlapping areas being 
chosen, any randomly-selected NE patch which did over- 
lap was "weeded out" as described below. An example 
of a "no-emergence" region is shown in Figure [6] 

This selection algorithm initially provided thousands 
of possible NE targets over the seven years. A subset 
of approximately 500, selected to generally follow the 
distribution in latitude, longitude, and time as the initial 
s et of PE targets, were used to acquire GONG data (see 



3.4). 



marily due to the existence of small (obviously, un- 
numbered) emerging flux regions at the center of the 
field-of-view which were not previously detected. While 
no specific criteria were used regarding increasing or 
changing total flux over the time interval, the single crite- 
rion specified above effectively performed to constrain se- 
lection to regions with impressively consistent magnetic 
flux levels, on the whole. 

Candidate NE regions were further evaluated and re- 
moved if the central 16° x 16° (used for the majority of 
the helioseismology analysis, see Section 3.4) overlapped 
with the central 16° x 16° portion of a PE or another NE 
at any time. The final numb er of NE controls available 
for distribution control (see § |3.3| below) was 308. 



3.3. Distribution Control 

An algorithm was developed for post-facto selection 
from a larger sample of controls (NE) to match the dis- 
tribution of the targets (PE) simultaneously in latitude, 
longitude, an d time. A non-pa rametric density estimate 
(NPDE; e.g., |Silvernian]|1986[ ), using the Epanechnikov 
kernel and the optimal smoothing parameter for a nor- 
mal distribution, was used to estimate the probability 
density function for the three variables on a regular grid 
in longitude, latitude, and ln(time) (see Figure [7]). The 
non-parametric approach was used to avoid misrepre- 
senting non-Gaussian distributions such as the latitude 
of emergence (which is decidedly and expectedly double- 
peaked); similarly, the logarithm of the time variabk^] 
was used to compensate for its extremely skewe d dis- 



tribution. A simulated annealing alg orithm (e.g., Press 
et al.| [T992l [Metropolis et al.|[l953l |Kirkpatrick et al 
1983) was employed to select the subset of NE of a sped- 



employed to select the subset ot INK ot a speci- 
fied size (equal to the number of PE) that minimizes the 
integrated absolute value of the difference between the 
two NPDEs (NE and PE). Using the integral preserves 
the general shapes of the distribution rather than (for 
example) employing a peak or maximum difference as a 
Kolmogorov-Smirnov test would do. 

The results of this matching exercise are shown for the 
three variables in Figure [7] A table listing coordinates 
for the final NE targets is provided in Table ??, where we 
list the MDI orbits generally containing the NE region, 
the mid -poi nt of the GONG day used for analysis (see 



From these, a subjective evaluation was made, re- 
moving approximately 20 targets from consideration pri- 

6 Gauss are used as units, with the understanding it is a pixel- 
averaged quantity. 



Section 3.4 below), and the coordinates of that mid- 
point (note we do not list to precisely, but it is fairly 
inconsequential) . 

The equal sample sizes of PE targets and NE controls 
impose a specific requirement on the statistical tests: the 
prior probabilities, of which type of event (PE or NE) is 
more or less frequent, is set to be equal. This statistical 
requirement is maintained even after a further restriction 
is placed on the data for acceptable GONG duty-cycle 
(see Section 3.4 and Table [3| which in fact creates small 
inequities in the sample sizes. With equal prior probabil- 
ities, the goal of determining whether these populations 
differ is emphasized. Were this a test of prediction, the 
sample sizes (hence prior probabilities) should reflect the 
chances of any random place on the Sun being a location 
and time of emergence; clearly this is a ratio of many 
thousands to one. 

7 Specifically, the logarithm of the number of Julian days since 
2001 July 25, two days before the first dataset. 
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3.4. Preparing the Doppler Velocity Cubes 

After the appropriate target selection, there is no dif- 
ference in the treatment of the PE and NE data-cubes 
produced from the GONG Doppler velocity data. Cubes 
32° x 32° in extent were tracked at the Carrington rate, 
and extracted from t he GONG 1-minute velocity data 
(Corbard et al. 2003). As indicated in Figures [l] and [81 
this extracted area is larger than the original 128 x 100 
MDI-pixel area used for initial evaluation. 

The final cubes used for this analysis are one "GONG- 
day" long (1664 min.); for the PE data, the cubes end 
16 minutes after the emergence time to due to a small 
communication error; given the temporal sampling of 
the magnetic field data, we do not assign significance to 
the 16 minutes aside from assuming there will be early 
emergence magnetic flux appearing near the end of the 
GONG-day. 

The extracted Doppler-v elocity data ar e re-projected 
using a Postel projection (Pearson 19901. The 1664- 
minute timeseries are then broken into five time intervals, 
each 384 minutes long but starting every 320 minutes 
(thus an overlap of 64 minutes between each interval). 
A schematic of the data and the temporal relationship 
between time intervals is shown in Figure [9] 

The GONG facility includes different observing sites 
whose data are combined to create full temporal cover- 
age. While the average duty cycle for GONG data is 
very high, at times the coverage falters for a variety of 
reasons. Intervals which fall below a duty cycle of 80% 
are not included in the analysis. This restriction removes 
data randomly; there is no reason for duty cycle to be 
tied to PEs preferentially over NEs, especially after the 
matching was performed for location and date. In addi- 
tion, what are removed from consideration are individual 
intervals rather than an entire PE or NE target. Table [3] 
presents the resulting sample sizes for PE and NE popu- 
lations by interval, after removing data with insufficient 
duty cycle. 

3.5. The Accompanying Magnetic Data for Analysis 

In addition to the considering each event (or lack 
thereof) as viewed by helioseismology, to confirm that 
the results are a result of subsurface processes, we pro- 
duced a complementary data set of the surface field. For 
analysis we attempt to mitigate projection effects present 
due to the fact that the MDI data detect only the line- 
of-sight component of the flux density (explained in de- 
tail below). We also want to match the measure of the 
surface magnetic field to the area and projection used 
with the GONG Doppler-velocity data cubes. To achieve 
this, first the location and 32° x 32° spatial extent of the 
GONG cubes were identified in MDI data covering the 
same time interval. 

To minimize projection effects and, more adroitly, use 
the most physically meaningful magnetic measure avail- 
able from the MDI data, we use a potential-field calcu- 
lation to retrieve an estimate of the radial component 
of the field. Specifically, the potential field was calcu- 
lated p _tc_djn^tl^^atc^ bound- 
ary ( |Sakurai||1982| IBogdan||1986| |Rudenko||2001l ), rather 
than assuming the boundary was equivalent to the radial 
component of the field. 

In general, the radial component of a potential field 



(without a source surface) in the volume above the solar 
surface can be expressed in a spherical harmonic expan- 
sion as 
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where the P z m are the associated Legendre functions, Rq 
is the solar radius, r is distance from the center of the 
sun, 9 is co-latitude, and cj> is longitude (measured for 
an y choice of the polar axis). Following the approach 
of Rudenko (2001) by taking the polar axis of the co- 
ordinate system to lie along the line of sight, and using 
relationshi ps among the associated Legendre functions as 
done by |Bogdan] ([1986), the coefficients can be written 
as 
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where R>i(Rq, [a, </>) is the line of sight component of the 
field at the solar surface. To ensure that the monopole 
term vanishes in the sum, we further assumed that the 
field on the far side of the Sun was given by E>i (Rq , it — 
9, <fr) — Bi(Rq,9, </>), where the front side of the Sun is as- 
sumed to lie in the range < 9 < n/2. This can produce 
some unphysical results very close to the limb, but does 
not greatly affect the field at the surface in the restricted 
area of the disk considered in this investigation. The in- 
tegrals were evaluated using a simple trapezoid method, 
and the spherical harmonics were computed using the 
freely available software archive SHTOOLS^J which have 
a relative error of less than 10 -5 up to degrees of at least 
2600. However, only terms up to degree of 1000 were 
included, as this is sufficient to reconstruct spatial scales 
on the order of the resolution of MDI. Note also that the 
acoustic modes in the GON G data are seen u p to about 



/ = 1000 (see Figure 1 from Birch et al. 



2012) 



The above calculations were performed on an extracted 
cube slightly larger than 32° x 32°, then the potential 
field radial component was subjected to Postel-projection 
and trimmed to exactly match the GONG datacubes. 
Hence, we have for each PE and NE data set, a time- 
series of the radial component of the field matched in 
area, and matched in projection, to sub-surface obser- 
vations made by helioseismology, albeit the latter by a 
different instrument. 

From these maps, an appropriate time-series of the 
history of the field at the target and its immediate sur- 
roundings is computed, for comparison with the results of 



available at http : //www . ipgp . f r /~wieczor/SHT00LS/SHT00LS . html 
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helioseismology. Sample pairs of average radial field den- 
sity and average corresponding Doppler data are shown 
in Figure [H] for the PE and NE examples of Figures [2] [6j 
For the se accompanying magnetic data, we show in 
Figure [TO] the unsigned radial field averaged over all sam- 
ples, for each of the time intervals used for the seismol- 
ogy analysis. To provide context, we extend this slightly 
in time and show the averages for two additional post- 
emergence time intervals. Of note are the distinct lack of 
variation in the NE data, but also the noticeable bands 
of stronger signal at the top and bottom of the NE data 
cubes compared to the central portion. The PE data 
show a distinct early signature of surface field 24hr prior 
to the emergence time, and a clear bipolar signature af- 
ter emergence is underway. The biolar structure is less 
clear but arguably present in the subset of PE data, the 
early signature is arguably completely absent when only 
the cleanest, "most virgin" examples were chosen. At 
the same time, averaging over a smaller number for the 
"ultra-clean" dataset allows a single sample to influence 
the average: the strong persistent signal on the right- 
hand portion of the "ultra-clean" mean in Figure [To] is 
primarily due to a strong plage area near NO A A AR9645. 

3.6. Further Corrections 

The latitude reported by NOAA was generally un- 
changed for extracting the GONG data cubes; the lon- 
gitude was obviously updated according to to. For the 
later a nalysis, espe c ially the averages over all samples 



used in Birch et al. (2012), the coordinates were refined 



in the following manner. The time-series of the radial 
magnetic field were used to construct bitmaps of new 
flux using the difference (|<5-B|) between the field roughly 
12 hr after to and the first time interval (roughly 24 hr 
before i ), and only including in the bitmap areas where 
|<5.B| > 0.3 x max(|5i?|). A centroid was created from the 
resulting bitmap, and the coordinates were then assigned 
to be the location of this centroid. 

In this manner, the analysis which is performed on av- 
erages taken over space, time, or sample, will provide 
results that are not diluted by subtle differences in emer- 
gence location within the field of view. Accordingly, no 
similar refinement was performed for the NE samples, as 
there are no events by which to define such a refinement. 

4. STATISTICAL CONTAMINATION ISSUES 

This is a study trying to detect a small difference be- 
tween two populations. How these populations are de- 
fined and the samples obtained, then, will directly affect 
the reliability of the results. The goal is that the PE 
regions be clear, distinct, isolated, fairly near disk-center 
emergence episodes, and the NE regions be emergence- 
free episodes matched to the PE distributions in loca- 
tion and time (effectively, solar-cycle activity level) as 
described above. 

Statistical contamination, the existence of a bias that 
will inadvertently identify the two populations without 
being directly related to the emergence process, may take 
a variety of forms. Alluded to in section §[2j we describe 
below our understanding of various contributions to pos- 
sible contamination, and our efforts to mitigate them. 

4.1. Nearby Field 



Ideally, the background, nearby, or pre-existing field in 
the NE targets (their distribution in space, flux density, 
total flux, etc.) are indistinguishable from that of the 
PE targets prior to emergence. The emergence episodes 
and non-emergence regions were initially characterized 
by 128 x 100-pixel tracked boxes in the MDI image-plane 
coordinate frame. The data cubes used for analysis were, 
as described above, 32° x 32° on a heliographic grid. The 
difference between these two can be seen in Figure [H] and 
is not insignificant. The most noticeable effect is that 
the NE cubes in fact often contain stronger field at the 
periphery than the cut-off used to select the smaller areas 
(Figure 10 1. The PE cubes were isolated from nearby 



active regions in the original 128 x 100-pixel evaluation, 
but strong field (active regions) can be found in the larger 
32° x 32° field of view. 

By comparing the signals averaged as shown in Fig- 
ure [TTJ it is clear that there is a bias: the median signal 
of magnetic field is larger in the PE samples as compared 
to the NE samples. Note that by showing the median, 
rather than the mean, the results are not influenced by 
outliers and the distributions display a real difference. 
For the full 32° x 32° field of view, the difference is sig- 
nificant but not large; when considering only the smaller 
central 16° x 16°, the PE sample result does not change 
noticeably (until emergence begins in the last time in- 
terval), whereas the NE sample median signal is quite 
reduced. This confirms that the initial 128 x 100-pixel 
evaluation area for the NE sample is "too quiet" com- 
pared to the enhanced signal in the NE sample periph- 
eries and to the PE sample, even tho ugh the selection 
threshold was a generous lkG (Section [372] ) . 

The source of this bias may be introduced or it may 
be a real effect. The emergence really could start more 
than a day before to, in which case there is no error, just 
a real physical effect only visible in the ensemble. How- 
ever, by imposing a field strength limit on the NEs but 
not PEs, we may have introduced an artificial bias into 
the samples. Due to the matching in latitude and lon- 
gitude, there should be no gross preferential prevalence 
of "background" field as there would be had all of the 
NE regions, for example, been selected outside the active 
latitudes or all in the same hemisphere. However, there 
was also no de-selection of PE candidates based on "ac- 



tive longitudes" (Petrovay fc Abuzeid 1991 Gaizauskas 
et al.|1994 Pojoga &i Cudnik 2005), and active longitude 
lifetimes are likely too short to be captured simultane- 
ously in the time-matching and longitude-matching. If 
the bias is the effect of active longitudes, this is a real 
(solar) bias towards having pre-existing field for the PEs. 
The fact that the NEs are "too quiet" implies that the 
inconsistent use of a threshold contributes to the bias, 
but may not be the only effect. The significance of this 
systema tic differen ce between the samples is discussed in 
detail in |Paper III| 

4.2. Nearby or Short-Lived Emergence 

It is conceivable that nearby or on-going short-lived 
flux emergence may contaminate the seismology signal 
we search for. 

No screening or diagnostics were performed to specifi- 
cally rule out nearby emergence episodes for either PE or 
NE samples. Both may have emerging flux regions in the 
periphery, and these datasets were not removed from con- 
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sideration (as long as there were no emerging flux regions 
within the central 16° x 16°, or w 100 Mm x 100 Mm). A 
variety of seism ic analyses will be performed with vary- 
ing pupil sizes (Birch et al. 20121, thus the influence of 
field in the sample peripheries can in fact be studied. 

Very small-scale short-lived emergence episodes, 
''ephemeral regions" are ubiqui tous and bring substan- 



We hypothesize without further investigation, that the 
pre-emergence surf ace field signature in the all-PE av- 
erages (Figure 10 1 is an indication of this very com- 



tial flux to the solar surface (Harvey & Zwaan 1993 
Hagenaar et al. 2003). The presence of ephemeral re- 



gions is not selected for or against, as their peak field 
strengths generally fall below the NE-selection threshold 
of lkG in MDI data, except one or two cases of removing 
an NE candidate due to a long-lived or especially large 
ephemeral region occurring at the center of the target. 
We make the assumption that the rate and distribution 
of ephemeral regions is the same between the samples of 
the NE and PE populations, and propose that no statis- 
tical bias is introduced due to the presence of ephemeral 
regions. 

4.3. Mis-Determination of Emergence Time 

Numerous sources of error could lead to a mis- 
determination of to, with effects presenting as bias or 
as random error. 

The coarse temporal resolution MDI data used here 
could lead to a significant amount of "new" surface flux 
being present for an hour or so before the "emergence 
time" to- The limited spatial resolution of the MDI data 
could lead to a significant amount of undetectable flux 
being present for an unknown period before the "emer- 
gence time" to- "Significant" is used here qualitatively, 
because it is the lack of data which is the primary source 
of the uncertainty itself. Lack of adequate sampling 
should add an element of random noise to comparisons 
between segments. The reliance on line-of-sight data, 
however, may present a systemmatic late determination 
of to with respect to observing angle, si nce early flux 



emergence is s ignaled by horizontal field ( Zwaan 1985 
Zhan glTSongll 1992| |Leka et al.|[l996j |Bernasconi eTaT 
2002||Kubo et al.|2003[ ) . In and of itself, the instrumental 
limitations should not present a statistical contamination 
between the NE and PE samples. Since the presence of 
surface field when none is expected (as due to the mis- 
determination of to) may impact the Doppler signal and 
hence the inferred helioscismic parameters, the results 
for the time interval comprising the last hours prior to 
to will b e interpreted with this uncertainty taken into 
account ( Birch et al.||2012[ |Barnes et al.||2012[ ) . 

Of a more subtle nature, in terms of this study, is the 
nature of flux emergence itself, the early evolution of ac- 
tive regions, and whether or how a very young active 
region is distinguishable from the general evolving mag- 
netic background. While we employed an objective and 
quantitative method to determine to, as needed for a sta- 
tistical study, upon examination of any individual case, 
to could be argued with. An example is shown in Fig- 
ure [3] An area of unchanging plage is co-spatial with the 
eventual emergence of NOAA AR 9564, and episodes of 
small bipoles appearing are evident prior to to upon de- 
tailed inspection. These bipoles would not gain attention 
beyond the numerous ephemeral re gions continuously ap- 
pearing on the surface (see Section 4.2) and indeed they 



mon characteristic: pre-emergence field can be present, 
whether as remnant plage or very early emergence 
episodes that are un-notable in any individual PE time 
series. As commented on earlier, when only examples 
are selected for which - by visual inspection - there is 
no pre-emergence surface field, the pre-emergence field 
signature is reduced if not absent. A third option is that 
very early emerging flux is distribu ted and weak, and 
detectable only on average (Figure nm with the MDI 
data due to the significantly reduced noise; in this case 
the pre-emergent surface field signature is absent for the 
clean subset not due to their "ultra-clean" nature, but 
due to the smaller number of datasets being averaged, 
and hence the increased noise (compared to averages for 
all regions). 

The impact of a varied flux-emergence rate on later 
analysis should be a source of noise but not statistical 
contam ination. The rate of emergence of new flux was 
cited in Ilonidis et al. ( 2011| as a key parameter relating 
to the strength and timing of the pre-emergence signal. 
However, it does not bias the NE vs. PE samples. 

The final evaluation is that to may be mis-determined 
by an amount comparable to the MDI 96-min sampling, 
hence the final time interval used for helioseismology 
analysis will be assumed contaminated by early emer- 
gence. Smaller episodes of new flux appearance are in- 
distinguishable from that which routinely occurs over the 
solar disk without the subsequent formation of an active 
region, and can simply be considered a source of noise 
for the present analysis. 

5. DISCUSSION 

The tools of local helioseismology rightfully hold hope 
of sensitive and powerful diagnostic tools of the solar sub- 
surface structure, evolution, and behavior. To interpret 
the helioseismic signals with physical insight, they must 
be isolated to those relevant to the events in question. To 
fruitfully make use of the signals, the full extent of bias 
and contamination must be understood. We have de- 
signed a study to examine what signatures prior to the 
appearance of solar active regions may be detected by lo- 
cal helioseismology tools and data at this time, and out- 
lined the data selection criteria and preparation herein. 

This study focuses on determining whether or not a 
seismology signal is evident prior to emergence and what 
its character might be. The goal is, as discussed in Sec- 
tion[l] inferring changes in the subsurface associated with 
active-region fo rmation. Based o n the preparation de- 
scribed here, inlBirch et al. (2012) we report on average 
sub-surface properties of the two samples (PE and NE) 
as derived using helioseismic holography, and find statis- 
tically significant signatures in average subsurface flows 
and wave speeds, but do not detect evidence of strong 
spatially extended flows in the t op 20 Mm dur i ng the day 
preceding visible emergence. InlBarnes et al. (2012), pa- 



did not gain NOAA's attention, except that they were 
located where NOAA AR 9564 eventually appeared. 



ramcters are derived from the seismology and magnetic 
field to characterize each of the PE and NE regions, and 
discriminant analysis is used to measure differences be- 
tween the sample sets. While statistically significant dif- 
ferences are found from this analysis, it is found that no 
single parameter can clearly distinguish a pre-emergence 
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from a non-emergence for any single region. 

To mitigate sources of bias, the distributions of the 
samples are matched in location and time (epoch within 
the solar cycle). This approach is novel; however Pre- 
Emergence areas are targeted here exactly because they 
did form an active region significant enough to be no- 
ticed by NOAA. The PE targets can thus be studied 
with respect to the known location and time of emer- 
gence, and the magnetic- and seismology-based analy- 
sis performed with respect to the target's known coor- 
dinates. There is a random component in the selection 
of the No-Emergence regions, but they, too, are selected 
with knowledge that no emergence occurred within a spe- 
cific time interval. Hence, there is a bias in that we 
are pre-selecting targets for study according to what is 
known to have happened. 

There is an intrinsic difference between this study de- 
sign and any attempt at "forecasting" the emergence of 
an active region. A forecasting study would instead be 
required to sample all possible emergence sites and com- 
pare the signals to all other possible sites, without a pri- 
ori knowledge aiding the analysis methods. At the very 
least, a study designed for forecasting must employ sam- 
ples and statistics which reflect the prior probability of an 
active region emerging at a randomly selected place and 
time over the observable disk, which is extremely s mall. 



While the results p resented in Birch et al 



and 

Barnes et al.JJ 2012), and the available "blind" datasets 
(see Section below) may serve to guide later stud- 
ies of the true forecasting ability of seismology for ac- 
tive region appearance, we caution that study design and 
attention to prior probabilities are crucial to answering 
specific questions posed. 



tween this (and [Birch et al. 


(20121) and iBarnes et al. 


(2012)) may be made to e.g., 


Komm et al.||2009 ( 


2011); 



we focus on the period prior to any surface field - whereas 
the earlier studies included both new and growing active 
regions (with surface flux present). As such, the present 
study may be seen as an extension of case-s tudies which 



also focused on pre-emergence periods ( e.g., |Jensen et al. 



2001 



Ilonidis et al. 2011 Braun| 2012 ) to statistically- 



significant sample sizes, however the methods and inter 
pretive tools (depths, cadence, control samples if any) 
differ between these studies and the present one. We 
describe here the steps taken to acquire both the sta- 
tistically significant sample size with a clear f ocus on 
pre-emergence phenomena (although see Section 5.1 be- 
low). With better tools and analysis approaches, the 
sometimes conflicting results in the literature should give 
way; then, only those effects which are truly specific to 
the emergence process will be the focus of discussion. 

Any seismic changes detected prior to surface changes 
will be evaluated in the context of the predictions made 
by different theories covering the source and formation 
mechanisms of solar active regions. But the seismology 
is influenced by the early surface behavior, the interpre- 
tation of the surface behavior is influenced by our under- 
standing of the emerging-flux scenarios, which is what 
we are trying to learn about using seismology. The anal- 
ysis has a circularity to it which implies one thing most 
strongly: interpretation must be done with utmost care. 
Only then can model predictions be validated. 



Emergence scenarios differ between active regions with 
respect to rate of flux increase, the existence of distinct 
emergence episodes, location with respect to remnant 
field, etc. As mentioned above, the early evolution of 
active regions is an active research area and distinctly 
tied to the sub-surface behavior which is the focus of 

this study. 

There are efforts underway (Martens et al. 20121 to 



perform automatic feature recognition on data from, for 
example, the instruments of SDO. Combining emergence 
indications from HMI and AIA may be advantageous. 
Using such database of emergence times defined by an in- 
dependent algorithm may lend objectivity to the results 
and ease of acquiring the larger samples we suggest, but 
it must be accompanied by research on the early evolu- 
tion of active regions. 

5.1. For Future Studies 

Hindsight enables future improvement. The flaws of 
a study design become distinctly clear as the study pro- 
gresses and "issues" arise; in the best situations the flaws 
can be remedied, but in many cases due to resource lim- 
itations, corrections or accommodations must be made 
mid-course. Specific effects that the flaws in the present 
study's design had on the results will be discussed in 



Birch et al. (2012); Barnes et al. (2012) as appropriate 



Whereas this paper discusses the details of the design, 
the results also comprise the lessons learned over the du- 
ration of this study: 

1. Characterizing early active-region appearance and 
evolution is very much a research topic. The (ob- 
jective, independent) determination of emergence 
time and location should be performed using, ide- 
ally, vector magnetic field d ata to detect the ear- 
liest horizontal field (Section 4.3); vector data may 
thus alleviate any systemmatic bias in to as a func- 
tion of observing angle. Resolution issues aside, the 
early evolution of active regions may form a spec- 
trum of behavior such that assigning a single loca- 
tion and time is, in fact, inappropriate. However, 
for a statistical study, the determination of emer- 
gence time must be performed, as we did here, in 
an objective and repeatable manner - recognizing 
that the answer is very sensitive to data resolution, 
sensitivity, and cadence. 

2. Data selection rules must be applied to areas used 
in the final analysis with minimal discrepancies. 
As described in Section [3j the initial evaluation of 
the PE and NE regions was performed on a much 
smaller field of view than was eventually extracted 
from the GONG data (and than was also eventu- 
ally extracted from the MDI data for magnetic- 
field comparisons). As such, there was more, and 
more varied, peripheral activity than was expected 
in both the NEs and PE samples. The contamina- 
tion is inevitable given the large number of active 
regions during solar maximum activity; still, the 
bands of significant field in the p erip hery of the 
NE average magnetograms (Figure 10 ) were unex- 
pected. 

3. The distribution of background field must also be 
matched between populations in a manner analo- 
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gous to matching the distributions of location and 
time (epoch within a solar cycle). That is, there 
is in fact a bias in the data sets used here, since 
the NE regions are overall quieter, with less back- 
ground field, than the PEs (see Figure [TTj) . Rather 
than just select for "no field above a cerlain thresh- 
old" for the NE regions, areas should be selected 
which match the pre-emergence background field 
distribution characteristics of the PEs. This task 
is not trivial. 

4. Related, the spatial distribution of the field may be 
important, since the seismology signatures are de- 
rived from Doppler signals both at the focal point 
and in an annulus, whose size relates to the depth 
sampled. Regions emerging into an existing plage 
area will have a different surrounding flux distri- 
bution than very-quiet non-emergence areas. Con- 
trariwise, if stable plage areas are chosen preferen- 
tially as the non-emergence targets, then a bias is 
clearly introduced. 

5. Utilize helioseismic data and magnetic data from 
the same source, if at all possible. It was unfor- 
tunate that magnetic field data were not readily 
available for this study from the GONG system. 
HMI is the logical data source for any follow-on 
statistical study to what is presented here. 

6. Examine 48 hr or more prior to emergence rather 
than only 24 hr (and, of course, match this for the 
control data). This will decrease the number of 
regions available within suitable observing angles, 
however will allow additional evolution to be de- 
tected. 

7. For studies that employ statistical analysis, ini- 
tial target sample sizes should be 5-10 times larger 
than assumed sufficient for the final analysis. The 
robustness of results depends on noise in the data 
and the many sources of bias. But it also involves 
an interplay between sample sizes vs. the number 
of variables tested. The larger the sample size, and 
the larger that size is relative to the number of vari- 
ables under consideration, the smaller the chance 
of statistical flukes in outcome. The initial "PE" 
target list for this study numbered almost 500 re- 
gions; after removing targets due to data prob- 
lems, significant spatial/temporal overlap, match- 
ing for latitude/longitude/epoch, and accounting 
for duty-cycle limitations, each time-interval used 
had ss 85 — 90 samples. 

5.2. Data Availability 

Despite the shortcomings identified above, the present 
study provides a rich data set for investigating questions 
of pre-emergence signatures of solar active regions, the 
sensitivity of results to methodology, etc. 

To that end, we make the datasets used in this 
study available through http://www.cora.nwra.com/ 
LWSPredictEmergence\/Site/Data_Sets .html (follow- 
ing the link which cites this paper) . They have been pre- 
pared for double-blind tests, in that the data from both 
PE and NE samples are available but have been ran- 
domized with all identifying information removed from 



filenames and file headers. Also included at that web- 
site will be an uploadable form by which to submit "an- 
s wers" to the same Discriminant Analysis code used in 



Barnes et al. (2012), so that groups interested in direct 
method comparisons can quantitatively compare perfor- 
mance against our published results. 
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Figure 1. A schematic showing the relative sizes of areas considered during the data preparation and analysis. The solar Stonyhurst 

disk is shown with lines at 10° latitude and longitude intervals ( ). The black box indicates 128 X 100-pixel area of an MDI image 

used for the initial evaluation, in this case centered at N30 W30. The larger box (blue) is a Postel projection region 32° X 32°, showing 
the area extracted for the tracked Doppler data from GONG, and the corresponding area of computed radial-component of the field from 
MDI extracted for the full anal ysis. The red circl es indicate the size and width of the largest annulus (filter "TD11") used for computing 
helioseismology parameters (see |Birch et al.|2012} . 
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Figure 2. An example of a Pre-Emergence target, NOAA AR 10559 which had an assigned emergence time t of 2004-02-13Tll:15:02.677Z 
(see text for details) at N07 W22.4. The images (a-e) are the 128 X 100-pixel images from the MDI full-disk linc-of-sight magnetic data used 
for initial evaluation of the emergence episode, all scaled to ±500 G. The image (c) shows the assigned "emergence time" to- The temporal 
evolution of the pseudo-flux <I>i os = I-^IosI/m AA for this test field of view is shown in (f), as a function of time relative to the inferred 
time of emergence, determined as the first MDI magnctogram when 10% of the eventual maximum change in flux, 5>i os , has appeared (c). 
Data points for the images shown are filled in and labeled; a mix of 30 s and 300 s MDI data are both used and shown here, evident by the 
different apparent noise levels. 
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Figure 3. An example of a Pre-Emergence target with a less-clear emergence time, NOAA AR 9564 which had an assigned emergence 
time t of 2001-08-01T06:27:01.250Z (see text for details) at N14 W21.2. The images (a-e) are in the same format as in Figure [2] The 
temporal evolution of <J>i os is shown in (f), except that the maximum of the region attained is truncated to better show the early evolution. 
Data points for the images shown are filled in and labeled. In this case, the "background" 'I'los = 0.96 X 10 21 Mx is larger than the previous 
example. The region eventually reached 9.5 X 10 21 Mx, or a maximum increase of 8.6 X 10 Mx, hence point (d) at 1.8 X 10 21 Mx was 
the identified emergence time by the objective algorithm. However, it is clear that a small episode of flux emergence apparently occurred 
between (b) and (c) as well. While in this case we can argue a 6hr uncertainty in the emergence times, there was very little if any surface 
signal of the emergence for many hours prior to (c). 
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Figure 4. Histogram of the maximum size of the sunspot group attained during disk visibility of the emerging active regions, in /iH 
(micro-hemispheres) as reported by the NOAA active region lists. The minimum reported size is 10/iH; the largest included in this sample 
was 930a*H. 
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Figure 5. The selection area and eventual data-extraction area of a non-emergence target. Three days of MDI 96-minute data beginning 
with MDI orbit #4053 (2004-02-06T00:03:02.469Z) have been averaged together, and shown here scaled to ±100 G. The black box shows the 
coverage of a 128 X 100-pixel box tracked over the three days, indicating the entire quiet or "NonEmerging" ("NE") area that consistently 
has only signal < 1000 G over the three d ays. The white box indicates the area of tracked GONG and MDI data eventually used for the 
full analysis, discussed in Sections |3.4|3.5| 
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Figure 6. The Non-Emergence target from Figure p] which had an assigned center-time to of 2004-02-07T08:03:02.501Z (see text for 
details) at S18.6 E17.4. The images (a-e) are the 12s X 100 images from the MDI full-disk line-of-sight magnetic data used for initial 
evaluation of the emergence episode, all scaled to ±500 G. The temporal evolution (f) of the pseudo-flux ^2 l^losl/V AA for this test field 
of view, as a function of time relative to the inferred time of emergence, scaled to match Figure [2] Data points for the images shown are 
filled in and labeled. 
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Figure 7. Distributions of (left) latitude, (center) heliographic longitude and (right) date at trjj the defined emergence time. Shown 
are the PE distributions (red), the larger sample of NE data (black, dotted) from which the matching algorithm drew the final sample 
(black, solid) which minimized the integrated difference between the PE and NE Non-Parametric Density Estimates of the three quantities 
simultaneously. Top row: histograms of the relevant quantities, hence indicating number in each bin; Bottom row: the NPDE distributions, 
on which the minimization was performed. The 1-D matches (one variable at a time) are shown here, whereas the optimization was 
performed on all three variables simultaneously. Hence, while better 1-D matches may certainly be obtainable, it would be at the cost of 
the 3-D match results. 
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Figure 8. Top: The 32° X 32° radial field image, matched to the GONG data area, for the same targets as Figures [2] and |6| Axes are 
shown in both Mm and degrees from the center tangent point. Left: average of the PE target AR 10559 2004 February 13 23:59 - 2004 
February 14 06:23, Right: average NE-target field 2004 February 07 01:35-08:03 UT. Bottom: average GONG Doppler images for the same 
targets, for 384 minutes each (the length of an interval used for the helioseismology analysis): Left, for PE target AR 10559, for the same 
interval as the magnetogram average above, and Right: for the NE target, and the same interval. All images: grey boxes indicate the 
approximate area used for initial diagnostics (as in Figures [2] JsJ for reference, and as an explanation of the presence of significant magnetic 
flux, for example, in many NE targets. 
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Figure 9. A schematic which demonstrates the temporal relationship of the time intervals. The dotted line represents the time-series of 
GONG Doppler data, bold-face numbers across the top are in minutes; the five time intervals are labeled "TI-#" , and the central time of 
each interval, in hours relative to the end of the GONG data, is indicated below its label. The GONG data run 1664 minutes, and end 16 
minutes after the emergence time determined as described in the text. Intervals start every 320 minutes, are 384 minutes long, and overlap 
with neighboring intervals by 64 minutes. This schematic applies to both PE and NE data, albeit with a "fake" to for the NE targets which 
corresponds instead to exactly the end of the GONG data. 
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Figure 10. Averages over all samples of the unsigned radial field for each of the time intervals as accompanies the seismology data. 
In addition to the five primary intervals prior to emergence, we show here the averages for two additional intervals post-emergence, for 
comparison. Times indicate the central time of each interval, following Figure [9] All figures use the same grey-scale. Top: the NE samples, 
Middle: the PE samples, Bottom: the "Ultra-Clean" subset of PE samples. 
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Figure 11. Median of the area-averaged unsigned field and the errors in the median (using a bootstrap method), for both the PE (red), 
and NE (black) data, plotted as a function of the central time of the intervals relative to the end of the GONG day (which is effectively 
to)- Larger symbols (and muted red/grey) indicate that the median was taken over the entire extracted area, smaller (red/black) symbols 
indicate that only the smaller m 16° X 16° area used for the hclioseismology analysis was included. 
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Table 1 

Identifying Coordinates for Pre-Emergcncc Targets. 
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Region 


Emergence 


Location (°) 


Max Size 
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Table 2 

Identifying Coordinates for No-Emergcncc Targets. 



Region ID GONG-Day Ref. Location (°) 

(MDI Orbits) Reference Date Lat. Long. 
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2002-03-27 


16 


:48 


-4 


.4 


-35.0 


3405-3408 


2002-04-29 


18: 


:23 


-24 


,1 


-25.2 


3415-3418 


2002-05-10 


00 


:47 


5 


.6 


11.1 


3418-3421 


2002-05-13 


02 


:22 


15 


.6 


-19.2 


3430-3433 


2002-05-25 


02 


:24 


-2 


.3 


-6.0 


3455-3458 


2002-06-18 


16 


:51 


-8 


.9 


-17.2 


3456-3459 


2002-06-20 


08 


:48 


-21 


.1 


-13.0 


3471-3474 


2002-07-05 


02 


:27 


14 


.5 


-2.6 


3472-3475 


2002-07-05 


13 


:35 


12 


.0 


-23.1 


3479-3482 


2002-07-13 


08' 


:47 


-1 


.4 


3.0 


3484-3487 


2002-07-18 


02 


:23 


-29 


.0 


2.8 


3502-3505 


2002-08-04 


18 


23 


-21 


.9 


-10.7 


3508-3511 


2002-08-11 


00' 


:47 


-11 


.4 


-21.4 


3519-3522 


2002-08-22 


02 


:27 


3 


.7 


0.4 


3535-3538 


2002-09-07 


05 


35 


-8 


.8 


-36.2 


3547-3550 


2002-09-19 


02 


:23 


12 


.9 


-29.0 


3555-3558 


2002-09-27 


00' 


:48 


-13 


,7 


-33.2 


3555-3558 


2002-09-27 


02 


:24 


15 


.3 


7.9 


3564-3567 


2002-10-05 


11 


59 


-6 


.0 


-0.1 


3565-3568 


2002-10-07 


00 


:51 


-18 


.7 


-22.9 


3597-3600 


2002-11-08 


07 


:11 


-29 


.4 


-32.7 


3600-3603 


2002-11-11 


10: 


:24 


-4 


.5 


-12.8 


3607-3610 


2002-11-17 


16 


:51 


11 


.0 


-17.7 


3607-3610 


2002-11-18 


07 


15 


6 


.2 


7.3 


3611-3614 


2002-11-22 


05 


36 


10 


.4 


8.5 


3615-3618 


2002-11-26 


02 


:27 


-13 


.2 


-2.0 


3636-3639 


2002-12-17 


04 


03 


2 


.8 


-35.9 


3646-3649 


2002-12-27 


08' 


.48 


-12 


.5 


-2.7 


3683-3686 


2003-02-02 


11 


:59 


17 


.1 


-27.6 


3694-3697 


2003-02-13 


05 


■35 


11 


.2 


-6.7 


3703-3706 


2003-02-21 


18: 


:26 


-7 


.7 


5.9 


3703-3706 


2003-02-21 


20 


:02 


-10 


.1 


-11.2 


3753-3756 


2003-04-12 


18: 


:23 


-6 


.6 


-8.5 


3758-3761 


2003-04-18 


08' 


:47 


-5 


.9 


-25.4 


3780-3783 


2003-05-10 


07 


:11 


-15 


.8 


9.0 


3782-3785 


2003-05-11 


18: 


:22 


-13 


.1 


-27.2 


3789-3792 


2003-05-19 


05 


.35 


-5 


.4 


-7.5 


3795-3798 


2003-05-24 


19 


:59 


24 


.6 


-3.6 


3802-3805 


2003-05-31 


11 


:59 


-20 


.7 


-33.0 


3810-3813 


2003-06-09 


08 


:47 


-27 


.6 


12.6 


3855-3858 


2003-07-24 


02 


22 


23 


2 


-27.7 


3863-3866 


2003-07-31 


13: 


35 


-7 


.6 


17.9 


3874-3877 


2003-08-12 


02 


:27 


7 


.9 


1.0 


3893-3896 


2003-08-30 


15 


:15 


10 


.3 


-32.5 


3910-3913 


2003-09-16 


21: 


:36 


15 


.7 


-4.5 


3969-3972 


2003-11-14 


23 


:11 


18 


.4 


11.1 


3983-3986 


2003-11-28 


21 


:39 





.3 


-31.1 


3997-4000 


2003-12-12 


15: 


:15 


-12 


.9 


1.5 


4043-4046 


2004-01-28 


00 


:47 


-8 


.4 


16.4 


4044-4047 


2004-01-29 


07' 


:11 


14 


.5 


-17.5 


4053-4056 


2004-02-06 


20' 


03 


-18 


.6 


-23.9 


4067-4070 


2004-02-20 


15 


:12 


-16 


.7 


13.1 


4073-4076 


2004-02-26 


16 


:47 


15 


.8 


-16.9 


4084-4087 


2004-03-08 


21: 


:39 


1 


.6 


-25.9 


4138-4141 


2004-05-01 


13 


:35 


-11 


.8 


-19.7 


4157-4160 


2004-05-20 


18: 


.21 


-7 


.3 


-30.5 


4176-4179 


2004-06-08 


13 


39 


21 


.4 


11.2 
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Region ID GONG-Day Ref. Location (°) 

(MDI Orbits) Reference Date Lat. Long. 



4217-4220 


2004-07-19 


16: 


:47 


6 


.4 


-7. 


.0 


4228-4231 


2004-07-30 


23 


:11 


-17 





-5 


.9 


4256-4259 


2004-08-27 


15 


:11 


-10. 


.7 


-16 


.2 


4327-4330 


2004-11-07 


10: 


23 


-13. 


6 


-16 


.2 


4334-4337 


2004-11-13 


13 


35 


-7. 


.4 


20 


.1 


4404-4407 


2005-01-23 


00' 


.47 


-6. 


.2 


2 


3 


4441-4444 


2005-02-28 


15' 


15 


3 


6 


-4. 


.9 


4518-4521 


2005-05-17 


07 


:12 


-3 





-29. 


.1 


4551-4554 


2005-06-19 


00' 


:51 


10 


1 


-24. 


.2 


4580-4583 


2005-07-17 


19: 


:59 


10. 


.8 


2. 


.5 


4604-4607 


2005-08-10 


13 


35 


10. 


2 


-13 


.0 


4623-4626 


2005-08-29 


15: 


:11 


-8, 


5 


-25 


.2 


4628-4631 


2005-09-04 


04 


03 


13. 


3 


-17 


.5 


4669-4672 


2005-10-15 


07 


:12 


12. 


.0 


-18 


.5 


4676-4679 


2005-10-22 


08' 


:48 


-5. 


.2 


-27 


.7 


4733-4736 


2005-12-18 


07 


:12 


11 


8 


-33 


.0 


4833-4836 


2006-03-28 


02 


23 


-7, 


5 


-9. 


.5 


4834-4837 


2006-03-28 


16 


:51 


-11. 


8 


-25 


.7 


4913-4916 


2006-06-15 


21 


35 


-14. 


2 


-23 


.1 


4916-4919 


2006-06-19 


10 


:23 


-23. 


9 


-3 


.2 


4929-4932 


2006-07-02 


05' 


36 


-10 


1 


19 


.2 


4955-4958 


2006-07-27 


20 


03 


-12 


.7 


14 


.9 


4959-4962 


2006-08-01 


00' 


:51 


-9 


.7 


-20. 


.4 


5037-5040 


2006-10-18 


05' 


:36 


9. 


.4 


6 


.6 


5113-5116 


2007-01-02 


00 


:51 


-1. 


6 


6. 


.5 


5233-5236 


2007-05-01 


12: 


:03 


5 





22. 


.2 


5277-5280 


2007-06-15 


05' 


:36 


-2. 


.2 


-8. 


.4 


5335-5338 


2007-08-11 


12 


03 


24. 


.7 


-6 


.2 


5374-5377 


2007-09-19 


19: 


:59 


8, 


6 


-19. 


.4 


5384-5387 


2007-09-30 


02' 


:23 


-10 


.8 


5 


.4 


5397-5400 


2007-10-12 


15 


11 


-4. 





-1. 


4 


5411-5414 


2007-10-26 


18: 


:24 


-3 


.5 


-35 


.7 


5472-5475 


2007-12-26 


18 


:27 


-12 


2 


-34 


.7 
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Leka et al. 



Table 3 

Duty Cycle for NE, PE targets 



Time Interval NE samples PE samples Ultra-Clean PE samples 



TI-0 81 89 7 

TI-1 85 88 10 

TI-2 85 89 11 

TI-3 82 87 9 

TI-4 83 86 9 



